Here we will be exploring Red Wine Quality data set, where we will be checking which chemical properties influence the quality of red wines. The data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. As for the rating, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
These properties are:
Fixed Acidity: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
Residual Sugar: The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
Chlorides: The amount of salt in the wine.
Free Sulfur Dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and sulfide ion; it prevents microbial growth and the oxidation of wine.
Total Sulfur Dioxide: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
Density: The density of water is close to that of water depending on the percent alcohol and sugar content.
pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Sulphates: A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
Alcohol: the percent alcohol content of the wine.
In this section, we will get some sense of our data, starting with wines quality and then exploring all properties’ distributions.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
We will start investigating our data by the quality’s distribution first. The Quality rating for our data set ranges from 3 to 8, where 3 is closer to 0 (bad quality), and 8 that is close to 10 (good quality).
To make these quality numbers more readable, we will create a rating for each wine and group all wines into three groups as follow: * 0 - 4: Bad * 5 - 6: Average * 7 - 10: Good
Showing the count in percentage in the second figure showed us that almost 82% of red wines are rated average, while less than 4% are considered bad, and a bit less than 15% are the good wines.
Fixed acidity without outliers seems to be right skewed as well, as if all wine tend to have a fixed acidity closer to 7 but not less.
Volatile acidity does not have any kind of known distribution, no indication for anything.
We can notice from the above figure we have these long spikes, which by taking out outliers will remove them,which inappropriate here, since it does not really reflect the data.
Here, due to this long tail, we used log 10 transformation to observe the distribution better, and we can see the residual sugar’s distribution forms a right skewed shape.
Here, and again due to this long tail, we used log 10 transformation to observe the distribution better, and we can see chloride’s distribution forms a bell shape, meaning normal distribution.
Here we see that the free sulfur dioxide is normally distributed.
The total sulfur dioxide is right skewed, with 2 very extreme outliers.
pH here we can notice that the pH level is normally distributed.
Sulphates seems kind of normally distributed even after the log10 transformation.
This last one, alcohol here we can observe that alcohol is right skewed but it does not have that long tail, but it got some outliers and one extreme outlier.
The data set contains 1,599 sample of red wines with 12 variables. 11 are the chemical properties of each red wine, while the last one is the experts ratings for its quality.
I would like to explore how acidity chemicals affect the taste and quality of the wine, and what are the effects of sugar and alcohol to it.
I think that the fixed, volatile acidity, residual sugar and alcohol play major role in the quality of the wine.
I created the variable rating, which in and ordered factor, to classify the wine based on its quality to the following: 0 - 4: Bad 5 - 6: Average 7 - 10: Good
Most of the data were right skewed, with only pH and density closer to being normally distributed. The longest tails would go the Residual Sugar, Chlorides, Sulfur Dioxide and Sulphates.
This section will be divided into two, the first one will compare all elements to quality, while the second one will take two different elements and investigate their influence on each other.
Here we will explore 11 properties of the wine and how do they affect its quality.
Fixed acidity does not show any clear relation with the quality.
Volatile acidity is less when quality increases.
Citric acid does not show any clear relation with the quality.
Residual sugar does not show any clear relation with the quality.
Chlorides is less when quality increases.
Free sulfur dioxide does not show any clear relation with the quality.
total sulfur dioxide does not show any clear relation with the quality, but its range of amount becomes less when the quality is high.
Fixed acidity shows a weal relation, that is gets less with higher quality.
pH level does not show any clear relation with the quality.
Sulphates shows a very clear relation with the quality, as its amount becomes higher with higher quality.
Here alcohol percentage shows a kind of relation with the quality, as its percentage is bigger with higher quality.
Now after observing and limiting the y axis form some of the figures above, we could notice that there is some relation between the quality and following:
Positive Correlation: * Sulphates * Alcohol
Negative Correlation: * Total Sulfur Dioxide * Chlorides * Volatile Acidity
Now we will try to find other relations between different elements by the correlation matrix.
Here we will get some sense if there is an interesting relationship between any two variables to explore them more.
So from this correlation matrix, the following relations seemed interesting bi-variables:
We will explore them in three groups.
Since Citric acid has a quite big correlation with both the fixed and volatile acidity, let’s explore these two relations
We can observe from above that the higher the fixed acidity, the higher the citric acid, in this might be related that the amount of citric acid needed for the wine taste fresh increases as the fixed acidity increases, or it might suggest that citric acid is a form of fixed acidity. On the other hand,s the citric acid’s relation with the volatile acidity seems somewhat monotonically decreasing, but not strong enough to confirm it though.
Here we will check how the total sulfur dioxide and free sulfur dioxide are related.
This figure shows how strong the positive correlation is when the total sulfur dioxide is 50 or less, then this correlation starts getting after that.
Knowing that the density of the wine is primarily determined by the concentration of alcohol, sugar, and other dissolved solids. We will explore how different elements affect the density.
We can observe that both fixed acidity and chlorides have higher impact on the density in contrast of the residual sugar, which despite its increment, the density almost remain the same, meaning that other components are contributing to the taste.
Here I would like to investigate how fixed acidity and volatile acidity affect the pH level of the wine.
As expected, the higher the concentration of any of the acids, the wine tends to be more acidic. But in contrast, volatile acidity did not have any major impacts on pH, which was not expected for me.
We might conclude that less the density of water and volatile acidity, the better the quality of the wine is. In addition to Alcohol level that increase with the quality.
Citric acid correlation with the fixed acidity is strong and positive, which may lead us to think that citric acid might be a from of fixed acid. Because the same strong positive correlation appeared in the total and free sulfur dioxide.
The relationship between Alcohol and the wine quality, the citric acid and fixed acidity, total and free sulfur dioxide.
Motivated by the results obtained from the last section, and other questions, we will explore the data again but we will observe how do they affect quality as well.
Starting by acids again, we will check how these acids affect quality and how they related to each other.
We can observe from the first plot here that fixed acidity and citric acid do not have a strong relation with the quality, while the second one shows citric acid with low volatile acidity produce better wine quality. Same result appears in the third plot, suggesting the same result of the citric acid being a form of fixed acidity due to their strong correlation.
Now let’s see how these two different tastes affect the density of the wine, and then how do salt and sugar, sweetness and bitterness, affect each other.
The first plot suggest that the chlorides has negative correlation with the wine quality. While the second plot, the residual sugar that did not appear to have any strong correlation with either the density. The last figure does not show any interesting relation between chlorides and residual sugar.
Here will see the affect of both the alcohol and density on the pH, and how this all affect wine quality
Again, alcohol no change of pH here, but again, shows the same relation with quality, the higher alcohol, the better the quality. As for the density vs pH, they do not show any clear interesting pattern.
investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The relation of the volatile acidity and citric was very obvious with the quality. The less volatile acid we have, the better the quality is. And examining the pH, good quality pH tend to be higher with higher alcohol level in the wine.
It was interesting how strong the effect of fixed acid on pH is, as its decrements drove the pH to be more balanced very quickly
The pH figure here, and after taking smaller bandwidth, showed a normal distribution for data ranging between 3.0 and 3.6. Taking in mind that the pH scale is between 0 (Acid) and 14 (Base), with the middle point 7 (Water), we conclude that ant type of wine with the different elements contributing to its taste, should maintain this range of pH level.
After exploring different elements that might affect the density, the fixed acidity figure was the most interesting and obvious one. We observe these points pattern how do they form a strong positive correlation with the density. This suggest that the more the wine’s density is, we should expect a higher amount of fixed acidity in the wine.
This figure here shows that with higher alcohol content, the wine quality gets better. And this increment in Alcohol did not affect the pH level much, meaning that other elements played a role to balance the wine at its average pH level.
It might sound weird, but since I have never tried any wine in my life, it was like exploring and trying to imagine. But it was interesting to learn how all these elements affects the quality of the wine. You learn how to form question, and who to investigate data to get the answers you seek. As for the challenges faced, I struggled with how to explore this data, because I did not know if I am asking the correct questions or not. I overcame this problem by reading more about wine, and the affect of each one of them. I think my analysis could be improved if the data explanation contained some figures and visualization that easily demonstrate basic information about the wine elements.